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Abstract 

This paper addresses unsupervised discovery and local¬ 
ization of dominant objects from a noisy image collection 
with multiple object classes. The setting of this problem 
is fully unsupervised, without even image-level annotations 
or any assumption of a single dominant class. This is far 
more general than typical colocalization, cosegmentation, 
or weakly-supervised localization tasks. We tackle the dis¬ 
covery and localization problem using a part-based region 
matching approach: We use off-the-shelf region proposals 
to form a set of candidate bounding boxes for objects and 
object parts. These regions are efficiently matched across 
images using a probabilistic Hough transform that evalu¬ 
ates the confidence for each candidate correspondence con¬ 
sidering both appearance and spatial consistency. Domi¬ 
nant objects are discovered and localized by comparing the 
scores of candidate regions and selecting those that stand 
out over other regions containing them. Extensive exper¬ 
imental evaluations on standard benchmarks demonstrate 
that the proposed approach significantly outperforms the 
current state of the art in colocalization, and achieves ro¬ 
bust object discovery in challenging mixed-class datasets. 


1. Introduction 

Object localization and detection is highly challenging 
because of intra-class variations, background clutter, and 
occlusions present in real-world images. While significant 
progress has been made in this area over the last decade, as 
shown by recent benchmark results [11, 16], most state-of- 
the-art methods still rely on strong supervision in the form 
of manually-annotated bounding boxes on target instances. 
Since those detailed annotations are expensive to acquire 
and also prone to unwanted biases and errors, recent work 
has explored the problem of weakly-supervised object dis- 
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Figure 1. Unsupervised object discovery in the wild. We tackle 
object localization in an unsupervised scenario without any type of 
annotations, where a given image collection may contain multiple 
dominant object classes and even outlier images. The proposed 
method discovers object instances (red bounding boxes) with their 
distinctive parts (smaller boxes). (Best viewed in color.) 


covery where instances of an object class are found in a col¬ 
lection of images without any box-level annotations. Typi¬ 
cally, weakly-supervised localization [9, 35, 36, 43, 45, 56] 
requires positive and negative image-level labels for a target 
object class. On the other hand, cosegmentation [25, 29, 40] 
and colocalization [12, 27, 51] assume less supervision and 
only require the image collection to contain a single domi¬ 
nant object class, allowing noisy images to some degree. 

This paper addresses unsupervised object localization in 
a far more general scenario where a given image collection 
contain multiple dominant object classes and even noisy im¬ 
ages without any target objects. As illustrated in Fig. 1, the 
setting of this problem is fully unsupervised, without any 
image-level annotations, an assumption of a single domi¬ 
nant class, or even a known number of object classes. In 
spite of this generality, the proposed method markedly out¬ 
performs the state of the arts in colocalization [27, 5 ] on 
standard benchmarks [16, 40], and closely competes with 
current weakly-supervised localization [9, 43, 56]. 

We advocate a part-based matching approach to unsuper¬ 
vised object discovery using bottom-up region proposals. 
Multi-scale region proposals have been widely used before 
to restrict the search space for object bounding boxes in ob- 















ject recognition [9, 20, 53] and localization [9, 27, 51, 54]. 
We go further and propose here to use these regions to form 
a set of candidate regions not only for objects, but also for 
object parts. We use a probabilistic Hough transorm [2] 
to match those candidate regions across images, and assign 
them confidence scores reflecting both appearance and spa¬ 
tial consistency. This can be seen as an unsupervised and ef¬ 
ficient variant of both deformable part models [18, 19] and 
graph matching methods [5, 14]. Objects are discovered 
and localized by selecting the most salient regions that con¬ 
tain corresponding parts. To this end, we introduce a score 
that measures how much a region stands out over other re¬ 
gions containing it. The proposed algorithm alternates be¬ 
tween part-based region matching and foreground localiza¬ 
tion, improving both over iterations. 

The main contributions of this paper can be summarized 
as follows: (1) A part-based region matching approach to 
unsupervised object discovery is introduced. (2) An effi¬ 
cient and robust matching algorithm based on a probabilis¬ 
tic Hough transform is proposed. (3) A standout score for 
robust foreground localization is introduced. (4) Object dis¬ 
covery and localization in a fully unsupervised setup is ex¬ 
plored on challenging benchmark datasets [16, 4C]. 

2. Related work 

Unsupervised object discovery has long been attempted 
in computer vision. Sivic et al. [48] and Russell et al. [42] 
apply statistical topic discovery models. Grauman and Dar¬ 
rel [z ] use partial correspondence and clustering of lo¬ 
cal features. Kim and Torralba [28] employ a link analy¬ 
sis technique. Faktor and Irani [17] propose clustering by 
composition. Unsupervised object discovery, however, has 
proven extremely difficult “in the wild”; all of these previ¬ 
ous approaches have been successfully demonstrated in a 
restricted setting with a few distinctive object classes, but 
their localization results turn out to be far behind weakly- 
supervised results on challenging benchmarks [12, 28, 51]. 

Given the difficulty of fully unsupervised discovery, re¬ 
cent work has more focused on weakly-supervised ap¬ 
proaches from different angles. Cosegmentation is the prob¬ 
lem of segmenting common foreground regions out of a 
set of images. It has been first introduced by Rother et 
al. [38] who fuse Markov random fields with color his¬ 
togram matching to segment objects common to two im¬ 
ages. Since then, this approach has been improved in nu¬ 
merous ways [4, 6, 23, 55], and extended to handle more 
general cases [7, 25, 40, 54]. Given the same type of in¬ 
put as cosegmentation, colocalization seeks to localize ob¬ 
jects with bounding boxes instead of pixel-wise segmen¬ 
tations. Tang et al. [5 ] use the discriminative cluster¬ 
ing framework of [2 ] to localize common objects in a set 
of noisy images, and Joulin et al. [2 ] extend it to colo¬ 
calization of video frames. Weakly-supervised localiza¬ 


tion [9, 12, 35, 36, 46, 49] shares the same type of output 
as colocalization, but assumes a more supervised scenario 
with image-level labels that indicate whether a target ob¬ 
ject class appears in the image or not. These labels enable 
to learn more discriminative localization methods, e.g ., by 
mining negative images [9] . Recent work on discriminative 
patch discovery [15, 44, 50] learns mid-level visual repre¬ 
sentations in a weakly-supervised mode, and use them for 
object recognition [15, 44] and discovery [13, 50]. 

Region proposals have been used in many of the methods 
discussed so far, but most of them [12, 27, 28, 42, 51, 5 ] 
use relatively a small number of the best proposals (typ¬ 
ically, less than 100 for each image) to form whole ob¬ 
ject hypotheses, often together with generic objectness mea¬ 
sures [1]. In contrast, we use a large number of region pro¬ 
posals (typically, between 1000 and 4000) as primitive el¬ 
ements for matching without any objectness priors. While 
many other approaches [7, 40, 41 ] also use correspondences 
between image pairs to discover object regions, they do not 
use an efficient part-based matching approach such as ours. 
Many of them [7, 21, 40] are driven by correspondence 
techniques, e.g., the SIFT flow [32], based on generic local 
regions. In the sense that semi-local or mid-level parts are 
crucial for representing generic objects [18, 30], we believe 
segment-level regions are more adequate for object match¬ 
ing and discovery. The work of Rubio et al. [ z ] introduces 
such a segment-level matching term in their cosegmentation 
formulation. Unlike ours, however, it requires a reason¬ 
able initialization by an objectness measure [1], and does 
not scale well with a large number of segments and images. 

3. Proposed approach 

For unsupervised object discovery, we combine an effi¬ 
cient part-based matching technique with a foreground lo¬ 
calization scheme. In this section we first introduce the two 
main components of our approach, and then describe the 
overall algorithm for unsupervised object discovery. 

3.1. Part-based region matching 

For part-based matching in an unsupervised setting, we 
use off-the-shelf region proposals [34] as candidate regions 
for objects and object parts: Diverse multi-scale proposals 
include meaningful parts of objects as well as objects them¬ 
selves. Let us assume that two sets of region proposals R 
and R' have been extracted from two images 1 and 1', re¬ 
spectively. Let r = (/, /) G R denote a region with fea¬ 
ture / observed at location l. We use 8x8 HOG descrip¬ 
tors [10, 22] for / to describe the region patches, and center 
position and scale for l to specify the location. A match 
between r' and r is represented by (r, r'). For the sake of 
brevity, we use short notations V for two region sets, and m 
for a match: V = (R,R'), m = (r, r') in R x R'. Then, 
our probabilistic model of a match confidence for m is rep- 



(a) Input images. 



(b) Bottom-up region proposals. 



(c) Top 20 region matches. 

Figure 2. Part-based region matching using bottom-up region proposals, (a-b) Given two images and their multi-scale region propos¬ 
als [34], the proposed matching algorithm efficiently evaluates candidate matches between two sets of regions (2205 x 1044 regions in this 
example) and produce match confidences for them, (c) Based on the match confidence, the 20 best matches are shown by greedy mapping 
with a one-to-one constraint. The confidence is color-coded in each match (red: high, blue: low), (d) The region confidences of Eq.(4) are 
visualized in the heat map representation. Common object foregrounds tend to have higher confidences than others. (Best viewed in color.) 


resented by p(m\V). Assuming a common object appears 
in X and X', let the offset x denote its pose displacement 
between X and X', related to properties such as position and 
scale change. p(x\T>) becomes the probability of the com¬ 
mon object being located with offset x. Now, the match 
confidence is decomposed in a Bayesian manner: 

p(m\V) = y^p(m, x\T>) = ^2p(m\x,V)p(x\V) 

X X 

= p{rn a )^p(m z \x)p(x\V), (1) 

X 

where we suppose that appearance matching ra a is indepen¬ 
dent of geometry matching m g and an object location offset 
x. Appearance likelihood p(m a ) is simply computed as the 
similarity between / and /'. Geometry likelihood p(m g \x) 
is estimated by comparing displacement Z' — Z to the given 
offset x. For p(m g \x), we construct three-dimensional off¬ 
set bins for translation and scale change, and use a Gaussian 
distribution centered on offset x. 

The main issue is how to estimate geometry prior 
p(x\V) without any information about objects and their lo¬ 
cations. Inspired by the generalized Hough transform [ ] 
and its extensions [31, 57], we propose the Hough space 
score h(x |X>), that is the sum of individual probabilities 
p(m,x\V) over all possible region matches m G R x R '. 
The voting is done with an initial assumption of a uniform 
prior over x\ 

h(x\T>) = ^2p{m\x,V ) 

rri 

= '^2p(m li )p(m g \x), ( 2 ) 

rri 

which predicts a pseudo likelihood of common objects at 


offset x. Assuming p(x\V) oc h(x\V) 1 , we rewrite Eq.(l) 
to define the Hough match confidence as 

c(m\V) = p(mf) ^ p(m g \x)h(x\V ). (3) 

X 

Interestingly, this formulation can be seen as a combina¬ 
tion of bottom-up and top-down processes: The bottom-up 
process aggregates individual votes into the Hough space 
scores (Eq.(2)), and the top-down process evaluates each 
match confidence based on those scores (Eq.(3)). We 
call this algorithm Probabilistic Hough Matching (PHM). 
Leveraging the Hough space score as a spatial prior, it pro¬ 
vides robust match confidences for candidate matches. In 
particular, given multi-scale region proposals, different re¬ 
gion matches on the same object cast votes for each other, 
and make all the region matches on the object obtain high 
confidences. This is an efficient part-based matching pro¬ 
cedure with complexity of 0(nn'), where n and n' are the 
number of regions in R and R', respectively. As shown 
in Fig. 2c, reliable matches can be obtained when a proper 
mapping constraint ( e.g ., one-to-one, one-to-many, etc.) is 
enforced on the confidence as a post-processing. 2 

We define the region confidence as a max-pooled match 
confidence for r in R with respect to R'\ 

ct>(r ) = maxc((r, r')\(R, R')), (4) 

r' v J 

Although Eq.(l) is a valid probabilistic model, using h(x\T>) as de¬ 
fined by Eq.(2) is a heuristic way of estimating p(x\V) in terms of “pseudo 
likelihood”. Like all its “probabilistic Hough transform” predecessors we 
know of [3, 31, 33], however, it lacks a proper probabilistic interpretation. 

2 Here we focus on the use of match confidence for object discovery 
rather than the final individual matches. For more details on PHM, see our 
webpage: http : //www. di . ens . f r/willow/research/PHM/. 




















































































































which derives from the best matches from R' to R under 
one-to-many mapping constraints. High region confidences 
guarantee that corresponding regions have at least single 
good matches in consideration of both appearance and spa¬ 
tial consistency. As shown in Fig. 2d, the region confidence 
provides a useful measure for common regions between im¬ 
ages, thus functioning as a driving force in object discovery. 

3.2. Foreground localization 

Foreground objects do not directly emerge from part- 
based region matching: A region with the highest confi¬ 
dence is often a salient part of a common object while good 
localization is supposed to tightly bound the entire object re¬ 
gion. We need a principled and unsupervised way to tackle 
the intrinsic ambiguity in separating the foreground objects 
from the background, which is one of the main challenges 
in unsupervised object discovery. In Gestalt principles of 
visual perception [39] and design [24], regions that “stand 
out” are more likely to be seen as a foreground. A high 
contrast lies between the foreground and background, and 
a lower contrast between foreground parts or background 
parts. Inspired by these figure/ground principles, we evalu¬ 
ate a foreground score of a region by its perceptual contrast 
standing out of its potential backgrounds. To measure the 
contrast, we leverage on the region confidence from part- 
based matching, which is well supported by the work of Pe¬ 
terson and Gibson, demonstrating the role of object recog¬ 
nition or matching in the figure/ground process [3 ]. 

First, we generalize the notion of the region confidence 
to exploit multiple images. Let us assume X as a target im¬ 
age, and X' as a source image. The region confidence of 
Eq.(4) is a function of region r in target R with its best cor¬ 
respondence r' in source R' as a latent variable. Given mul¬ 
tiple source images, it can be naturally extended with more 
latent variables, meaning the best correspondences from the 
source images to r. Let us define neighbor images N of 
target image X as an index set of source images where an 
object in X may appear. Generalizing Eq.(4), the region 
confidence can be rewritten as 

V>(r) = max ^ c((r,r')|(i?, i?')) 

= V maxc((r,/)|(]?,]?')), (5) 

which reduces to the aggregated confidence from the neigh¬ 
bor images. More images may give better confidences. 

Given regions R with their region confidences, we evalu¬ 
ate a perceptual contrast for region r G R by computing the 
increment of its confidence from its potential backgrounds. 
To this end, we define the standout score as 

sir ) = ^(r) — max '0( r b) 5 
r h eB(r) 

s.t. B{r) = {r b I r C r b ,r b G R}, (6) 



(a) Region confidences with respect to multiple source images. 


(b) Measuring the standout score from the region confidences. 

Figure 3. Foreground localization, (a) Given multiple source im¬ 
ages with common objects, region confidences can be computed 
according to Eq.(5). More source images may give better region 
confidences, (b) Given regions (boxes) on the left, the standout 
score of Eq.(6) for the red box corresponds to the difference be¬ 
tween its confidence and the maximum confidence of boxes con¬ 
taining the red box (green boxes). In the same way, the standout 
score for the white box takes into account blue, red, and green 
boxes altogether as its potential backgrounds. Three boxes on the 
right are ones with the top three standout scores from the region 
confidence. The red one has the top score. (Best viewed in color.) 

where r C rb means region r is contained in region rb. 
The idea is illustrated in Fig. 3b. Imagine a region gradu¬ 
ally shrinking from a whole image region, to a tight object 
region, to a part region. Significant increase in region confi¬ 
dence is most likely to occur at the point of taking the tight 
object region. In practice, we decide the inclusive relation 
r C rb by two simple criteria: (1) The box area of r is less 
than 50% of the box area of rb. (2) 80% of the box area of 
r overlaps with the box area of rb. 

The standout score reflects the principle that we perceive 
a lower contrast between parts of the foreground than that 
between the background and the foreground. As shown in 
the example of Fig. 3b, we can localize potential object re¬ 
gions by selecting regions with top standout scores. 

3.3. Object discovery algorithm 

For unsupervised object discovery, we combine part- 
based region matching and foreground localization in a co¬ 
ordinate descent-style algorithm. Given a collection of im¬ 
ages C, our algorithm alternates between matching image 
pairs and re-localizing potential object regions. Instead of 
matching all possible pairs over the images, we retrieve k 
neighbors for each image and perform part-based match¬ 
ing only from those neighbor images. To make the algo¬ 
rithm robust to localization failure in precedent iterations, 
we maintain five potential object regions for each image. 
Both the neighbor images and the potential object regions 









































are updated over iterations. 

The algorithm starts with an entire image region as an 
initial set of potential object regions Oi for each image Z^, 
and performs the following three steps at each iteration. 

Neighbor image retrieval. For each image Z 2 , k nearest 
neighbor images {Xj \ i G are retrieved based on the 
similarity between Oi and Oj. We use 10 neighbor im¬ 
ages (k = 10). 3 At the first iteration, as the potential ob¬ 
ject regions become entire image regions, nearest-neighbor 
matching with the GIST descriptor [52] is used. From the 
second iteration, we perform PHM with re-localized object 
regions. For efficiency, we only use the top 20 region pro¬ 
posals according to region confidences, which are contained 
in the potential object regions. The similarity for retrieval is 
computed as the sum of those region confidences. 

Part-based region matching. Part-based matching by 
PHM is performed on Xi from its neighbor images {Xj \ 
j E Ni}. To exploit current localization in a robust way, 
an asymmetric matching strategy is adopted: We use all re¬ 
gions proposals in Z^, whereas for the neighbor image Xj 
we take regions only contained in potential object regions 
Oj . This matching strategy does not restrict potential ob¬ 
ject regions in target Z^ while effectively utilizing localized 
object regions at the precedent step. 

Foreground localization. For each image Z^, the stand¬ 
out scores are computed so that the set of potential object 
regions Oi is updated to that of regions with top standout 
scores. This re-localization improves both neighbor image 
retrieval and region matching at the subsequent iteration. 

These steps are repeated for a few iterations until near¬ 
convergence. As will be shown in our experiments, 5 itera¬ 
tions are sufficient as no significant change occurs in more 
iterations. Final localization is done by selecting the most 
standing-out region at the end. The algorithm is designed 
based on the idea that better localization makes better re¬ 
trieval and matching, and vice versa. As each image is inde¬ 
pendently processed at each iteration, the algorithm is eas¬ 
ily parallelizable in computation. Object discovery on 500 
images takes less than an hour with a 10-core desktop com¬ 
puter, using our current parallel MATLAB implementation. 

4. Experimental evaluation 

The degree of supervision used in visual learning tasks 
varies from strong (supervised localization [18, 20]) to 
weak (weakly-supervised localization [9, 46]), very weak 
(colocalization [27, 51] and cosegmentation [40]), and null 
(fully-unsupervised discovery). To evaluate our approach 
for unsupervised object discovery, we conduct two types of 
experiments: separate-class and mixed-class experiments. 

3 In our experiments, the use of more neighbor images does not always 
improve the performance while increasing computation time. 


Our separate-class experiments test performance of our ap¬ 
proach in a very weakly supervised mode. Our mixed- 
class experiments test object discovery ”in the wild” (in 
a fully-unsupervised mode), by mixing all images of all 
classes in a dataset, and evaluating performance on the 
whole dataset. To the best of our knowledge, this type of 
localization experiments has never been fully attempted be¬ 
fore on challenging real-world datasets. We conduct ex¬ 
periments on two realistic benchmarks, the Object Discov¬ 
ery [40] and the PASCAL VOC 2007 [16], and compare the 
results with those of the current state of the arts in coseg¬ 
mentation [29, 25, 26, 40], colocalization [8, 12,42, 27, 51], 
and weakly-supervised localization [9, 12, 35, 36, 46, 5 ]. 

4.1. Evaluation metrics 

The correct localization (CorLoc) metric is an evalua¬ 
tion metric widely used in related work [12, 27, 46, 51], 
and defined as the percentage of images correctly localized 
according to the PASCAL criterion: ^ea^ufr^j > 0.5, 
where b p is the predicted box and b gt is the ground-truth 
box. The metric is adequate for a conventional separate- 
class setup: As a given image collection contains a single 
target class, only object localization is evaluated per image. 
In a mixed-class setup, however, we have another dimen¬ 
sion involved: As different images may contain different 
object classes, associative relations across the images need 
to be evaluated. As such a metric orthogonal to CorLoc, we 
propose the correct retrieval (CorRet) evaluation metric de¬ 
fined as follows. Given the k nearest neighbors identified by 
retrieval for each image, CorRet is defined as the mean per¬ 
centage of these neighbors that belong to the same (ground- 
truth) class as the image itself. This measure depends on k , 
fixed here to a value of 10. CorRet may also prove useful in 
other applications that discover the underlying “topology” 
(nearest-neighbor structure) of image collections. CorRet 
and CorLoc metrics effectively complement each other in 
the mixed-class setup: CorRet reveals how correctly an im¬ 
age is associated to other images, while CorLoc measures 
how correctly an object is localized in the image. 

4.2. The Object Discovery dataset 

The Object Discovery dataset [40] was collected by the 
Bing API using queries for airplane, car, and horse, re¬ 
sulting in image sets containing outlier images without the 
query object. We use the 100 image subsets [4( ] to enable 
comparisons to previous state of the art in co segmentation 
and colocalization. In each set of 100 images, airplane, car, 
horse have 18, 11,7 outlier images, respectively. Following 
[t ], we convert the ground-truth segmentations and coseg¬ 
mentation results of [29, 25, 26, 4( ] to localization boxes. 

We conduct separate-class experiments as in [12, 51], 
and a mixed-class experiment on a collection of 300 images 
from all the three classes. The mixed-class image collection 




Table 1. CorLoc (%) on separate-class Object Discovery dataset. 


Methods 

Airplane 

Car 

Horse 

Average 

Kim et al. [29] 

21.95 

0.00 

16.13 

12.69 

Joulin et al. [25] 

32.93 

66.29 

54.84 

51.35 

Joulin et al. [26] 

57.32 

64.04 

52.69 

58.02 

Rubinstein et al. [40] 

74.39 

87.64 

63.44 

75.16 

Tang et al. [51] 

71.95 

93.26 

64.52 

76.58 

Ours 

82.93 

94.38 

75.27 

84.19 


Table 2. Performance on mixed-class Object Discovery dataset. 


Evaluation metric 

Airplane 

Car 

Horse 

Average 

CorLoc 

81.71 

94.38 

70.97 

82.35 

CorRet 

73.30 

92.00 

82.80 

82.70 




Figure 4. Average CorLoc (left) and CorRet (right) vs. # of itera¬ 
tions on the Object Discovery dataset. 



Figure 5. Examples of localization on unlabeled Object Discov¬ 
ery dataset. Small boxes inside the localized object box (red box) 
represents five most confident part regions. (Best viewed in color.) 


contains 3 classes and 36 outlier images. Figure 4 shows the 
average CorLoc and CorRet over iterations, where we see 
the proposed algorithm quickly improves both localization 
(CorLoc) and retrieval (CorRet) in early iterations, and then 
approaches a steady state. In the separate-class setup, Cor¬ 
Ret is always perfect because no other object class exists 
in the retrieval. As we have found no significant change in 
both localization and retrieval after 4-5 iterations in all our 
experiments, we measure all performances of our method in 
this paper after 5 iterations. The separate-class results are 
quantified in Table 1 , and compared to those of state-of-the- 
art cosegmentation [29, 25, 26] and colocalization [40, 51] 
methods. Our method outperforms all of them in this setup. 
The mixed-class result is in Table 2, and examples of local¬ 
ization are shown in Fig. 5. Remarkably, our localization 
performance in the mixed-class setup is almost the same as 
that in the separate-class setup. Localized objects are visu¬ 
alized in red boxes with five most confident regions inside 
the object, indicating parts most contributing to discovery. 
Table 2 and Fig. 4 show that our localization is robust to 
noisy neighbor images retrieved from different classes. 


Table 5. CorLoc comparison on PASCAL07-6x2. 


Method 

Data used 

Avg. CorLoc (%) 

Chum and Zisserman [8] 

P + N 

33 

Deselaers et al. [12] 

P + N 

50 

Siva and Xiang [ 17] 

P + N 

49 

Tang et al. [ T] 

P 

39 

Ours 

P 

68 

Ours (mixed-class) 

unsupervised 

54 


>—■ aeroplane 
■ bicycle 
►— boat 

*■■■■ bus 

horse 

motorbike 

Average 




Figure 6. CorRet variation on mixed-class PASCAL07-6x2. 



Figure 7. Example results on mixed-class PASCAL07-6x2. 


4.3. PASCAL VOC 2007 dataset 

The PASCAL VOC 2007 [16] contains realistic images 
of 20 object classes. Compared to the Object Discovery 
dataset, it is significantly more challenging due to consid¬ 
erable clutter, occlusion, and diverse viewpoints. To facili¬ 
tate a scale-level analysis and comparison to other methods, 
we conduct experiments on two subsets of different sizes: 
PASCAL07-6x2 and PASCAL07-all. The PASCAL07-6x2 
subset [12] consists of all images from 6 classes (aero¬ 
plane, bicycle, boat, bus, horse, and motorbike) of train+val 
dataset from the left and right aspect each. Each of the 
12 class/viewpoint combinations contains between 21 and 
50 images for a total of 463 images. For a large-scale ex¬ 
periment with all classes following [9, 12, 36], we take all 
train+val dataset images discarding images that only con¬ 
tain object instances marked as “difficult” or “truncate”. 
Each of the 20 classes contains between 49 and 1023 images 
for a total of 4548 images. We refer to it as PASCAL07-all. 

Experiments on PASCAL07-6x2. In the separate-class 
setup, we evaluate performance for each class in Table 3, 
where we also analyze each component of our method by 
removing it from the full version: ‘w/o MOR’ eliminates 
the use of multiple object regions over iterations, thus main- 
















































































Table 3. CorLoc performance (%) on separate-class PASCAL07-6x2 



aeroplane 

bicycle 

boat 

bus 

horse 

motorbike 


Method 

L 

R 

L 

R 

L 

R 

L 

R 

L 

R 

L 

R 

Average 

Ours (full) 

62.79 

71.79 

77.08 

62.00 

25.00 

32.56 

66.67 

91.30 

83.33 

86.96 

82.96 

70.59 

67.68 

Ours w/o MOR 

62.79 

74.36 

52.08 

42.00 

15.91 

27.91 

61.90 

91.30 

85.42 

76.09 

48.72 

8.82 

53.94 

Ours w/o PHM 

39.53 

38.46 

54.17 

60.00 

6.82 

9.30 

42.86 

73.91 

68.75 

82.61 

33.33 

2.94 

42.72 

Ours w/o STO 

34.88 

0.0 

2.08 

0.0 

0.0 

4.65 

0.0 

8.70 

64.58 

30.43 

2.56 

0.0 

12.32 


Table 4. CorLoc and CorRet performance (%) on mixed-class PASCAL07-6x2. 
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Average 

CorLoc 

62.79 

66.67 

54.17 

56.00 

18.18 

18.60 

42.86 

69.57 

70.83 

71.74 

69.23 

44.12 

53.73 

CorRet 
CorRet (class) 

61.40 

74. 

42.56 

39 

48.75 

72. 

56.80 

.35 

19.09 

29. 

13.02 

,43 

13.33 

44. 

30.87 

,32 

41.46 

52. 

41.74 

.66 

38.72 

59. 

43.24 

.04 

37.58 

55.36 


taining only a single potential object region for each im¬ 
age. ‘w/o PHM’ substitutes PHM with appearance-based 
matching without any geometric consideration, ‘w/o STO’ 
replaces the standout score with the maximum confidence. 
As expected, we can see that the removal of each compo¬ 
nent damages performance substantially. In particular, it 
clearly shows both part-based matching (using PHM) and 
foreground localization (using the standout score) are cru¬ 
cial for robust object discovery. In Table 5, we quanti¬ 
tatively compare ours to previous results [8, 12, 47, 51] 
on PASCAL07-6x2. Our method significantly outperforms 
those with a large margin. Note that our method does 
not incorporate any form of object priors such as off-the- 
shelf objectness measures [12, 47, 51], and only use pos¬ 
itive images (P) without more training data, i.e., negative 
images (N) [12, 47]. For the mixed-class experiment, we 
run our method on a collection of all class/view images in 
PASCAL07-6x2, and evaluate its CorLoc and CorRet pefor- 
mance in Table 4. To better understand our retrieval per¬ 
formance per class, we measure CorRet for classes (regard¬ 
less of views) in the third row, and analyze it by increasing 
the numbers of iterations and neighbor images in Fig. 6. 
This shows that our method achieves better localization and 
retrieval simultaneously, and benefits from each other. In 
Fig. 7, we show example results of our mixed-class experi¬ 
ment on PASCAL07-6x2. In spite of a small size of objects 
even partially occluded, our method is able to localize in¬ 
stances from cluttered scenes, and discovers confident ob¬ 
ject parts as well. From Table 5, we see that even without 
using the separate-class setup, the method localizes target 
objects markedly better than recent colocalization methods. 

Experiments on PASCAL07-all. Here we tackle a much 
more challenging and larger-scale discovery task, using 
all the images from the PASCAL07 dataset. We first 
report separate-class results, and compare our results to 
those of the state of the arts in weakly-supervised localiza¬ 
tion [9, 36, 43, 47, 45, 46, 56] and colocalization [27] in Ta¬ 
ble 6. Note that weakly-supervised methods use more train¬ 
ing data, i.e., negative images (N). Also note that the best 


performing method [5' ] uses CNN features pretrained on 
the ImageNet dataset [11], thus additional supervised data 
(A). Surprisingly, the performance of our method is very 
close to the best of weakly-supervised localization [9] not 
using such additional data. 

In the mixed-class setting, we face an issue linked to the 
potential presence of multiple dominant labeled (ground- 
truth) objects in each image. Basically, both CorLoc and 
CorRet are defined as a per-image measure, e.g ., CorLoc 
assigns an image true if any true localization is done in the 
image. For images with multiple class labels in the mixed- 
class setup, which is the case of PASCAL-all with highly 
overlapping class labels (e.g., persons appear in almost 1/3 
of images), CorLoc needs to be extended in a natural man¬ 
ner. To measure a class-specific average CorLoc in such a 
multi-label and mixed-class setup, we take all images con¬ 
taining the object class and measure their average CorLoc 
for the class. The upper bound of this class-specific aver¬ 
age CorLoc may be less than 100% because only one lo¬ 
calization exists for each image in our setting. To com¬ 
plement this, as shown at the last column of Table 7, we 
add the ‘any’-class average CorLoc, where we assign an 
image true if any true localization of any class exists in 
the image. The similar evaluation is also done for Cor- 
Rec. Both ‘any’-class CorLoc and CorRet have an upper 
bound of 100% even when images have multiple class la¬ 
bels, whereas those in ‘Av.’ (average) may not. Note 
that the mixed-class PASCAL07-all dataset has a very im¬ 
balanced class distribution: the 20 classes have very dif¬ 
ferent numbers of images, from 49 (sheep) to 1023 (per¬ 
son). Yet, as quantified in Table 7, our method still per¬ 
forms well even in this unsupervised mixed-class setting, 
and its localization performance is comparable to that in the 
separate-class setup. To better understand this, we visual¬ 
ize in Fig. 8 a confusion matrix of retrieved neighbor im¬ 
ages based on the mixed-class result, where each row cor¬ 
responds to the average retrieval ratios (%) by each class. 
Note that the matrix reflects class frequency so that the per¬ 
son class appears dominant. We clearly see that despite 
relatively low retrieval accuracy, many of retrieved images 







































Table 6. CorLoc (%) on separate-class PASCAL07-all, compared to the state of the arts in weakly-supervised / co- localization. 
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Table 7. CorLoc and CorRet performance (%) on mixed-class PASCAL07-all. (See text for ‘any’). 
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Figure 8. Confusion matrix of retrieval on mixed PASCAL07-all. 



Figure 9. Localization in an example and its neighbor images on 
mixed-class PASCAL07-all. A bus is successfully localized in the 
image (red dashed box) from its neighbors (10 images) contain¬ 
ing even other classes (car, sofa). Boxes in the neighbors show 
potential objects at the final iteration. (Best viewed in color.) 


part-based approach to object discovery effectively benefits 
from different but similar classes without any class-specific 
supervision. Interestingly, the significant difference in re¬ 
trieval performance (CorRet) from 100% in the separate- 
class setup influences much less on localization (CorLoc). 
Further analysis of our experiments also reveals that in the 
case of an imbalanced distribution of classes, a class with 
lower frequency is harder to be localized than a class with 
higher frequency. To see this, consider ‘the highest’ (per¬ 
son, car, chair, dog, cat) and ‘the lowest’ (sheep, cow, 
boat, bus, dinningtable) in class frequency. We have mea¬ 
sured how much the average performance changes between 
the separate-class (clean) and mixed-class (imbalanced) set¬ 
tings. The average CorLoc of ‘the highest’ only drops by 
1.2%, while that of ‘the lowest’ drops by 9.4%. This clearly 
indicates that a class with lower class frequency is harder to 
localize in the mixed-class setting. Retrieval performance 
of ‘the lowest’ (CorRet 11.0%) is also much worse than 
that of ‘the highest’ (CorRet 30.7%). For more informa¬ 
tion, see our project webpage: http : //www. di . ens . 
fr/willow/research/objectdiscovery/. 

5. Discussion and conclusion 

We have demonstrated unsupervised object localization 
in the challenging mixed-class setup, which has never been 
fully attempted before on a challenging dataset such as [16]. 
The result shows that the effective use of part-based match¬ 
ing is a crucial factor for object discovery. In the future, 
we will advance this direction and further explore how to 
handle multiple object instances per image as well as build 
visual models for classification and detection. In this paper, 
our aim has been to evaluate our unsupervised algorithm 
per se, and have thus abstained from any form of additional 
supervision such as off-the-shelf saliency/objectness mea¬ 
sures, negative data, and pretrained features. The use of 
such information will further improve our results. 


come from other classes with partial similarity, e.g ., bicycle 
- motorbike, bus - car, etc. Figure 9 shows a typical exam¬ 
ple of such cases. These results strongly suggest that our 
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